One defintion of philosophy is the study of general and fundamental questions. Besides reading, analyzing, and studying enormous volumes of philosophical texts, is there another way (ideally faster) that we can unlock some of the topics discussed in philosophy? The goal of this notebook is to show a different way this question can be looked at using data.
The dataset we will be working with can be found on Kaggle. This dataset contains pre-proccesed sentences from various schools of philosophy. The notebook will show initial summary visualizations of the dataset, and then will use topic modeling to extract insights into what philosophers write about.
This notebook was created using Python 3.6. In addition to common python packages, the following packages are required:
It is also important to note that to load the pretrained model file, you must have sklearn version==0.20.3 (the same version used to build the model). The code version of this notebook does include the code to build the model, so it is not required to load the model using python's pickle package. Please note, model training does take a significant amount of time ~1hour on the entire dataset. Since the topic model used is not guarenteed to converge to the optimal solution, and can identify a local optimum instead, a random seed was set in the model definition so the output should create the same model seen in this notebook.
You can use the code below to install the packages if needed.
Here is a subset of what the raw data from the csv looks like. We have information on the title, author, and school of philosophy, as well as a pre processed setence and tokenized version (a list of each word/token in the sentence).
| title | author | school | sentence_lowered | tokenized_txt | |
|---|---|---|---|---|---|
| 0 | Plato - Complete Works | Plato | plato | what's new, socrates, to make you leave your ... | ['what', 'new', 'socrates', 'to', 'make', 'you... |
| 1 | Plato - Complete Works | Plato | plato | surely you are not prosecuting anyone before t... | ['surely', 'you', 'are', 'not', 'prosecuting',... |
| 2 | Plato - Complete Works | Plato | plato | the athenians do not call this a prosecution b... | ['the', 'athenians', 'do', 'not', 'call', 'thi... |
| 3 | Plato - Complete Works | Plato | plato | what is this you say? | ['what', 'is', 'this', 'you', 'say'] |
| 4 | Plato - Complete Works | Plato | plato | someone must have indicted you, for you are no... | ['someone', 'must', 'have', 'indicted', 'you',... |
The above chart shows the total number of sentences in each philosophical school in descending order. We can see here a majority of the sentences come from the schools Analytic, Aristotle, German Idealism, and Plato. While the schools Stoicism, Nietzsche, Communisim, and Capitalism have the least amount of total sentences.
Next lets look of the number of books/titles from each school that are contained in the dataset.
An interesting thing to note is that while Aristotle and Plato have the lowest number of titles (one each), they both are in the top 4 schools when looking at total sentences in the dataset. It is interesting to see that a large portion of sentences in the dataset can be attributed to two philosophical schools, that consist of 1 book and one author.
The dataset contains various sentences from different philosophical texts from different authors and schools of thought. From this dataset of approximately 360 thousand sentences, can we determine what type of things authors wrote about? What about schools of philosophy that discuss similar topics?
To look into these questions, we can fit a topic model over our dataset to discover themes in the text by identifying different topics discussed in these sentences. Topic modeling is an unsupervised method to cluster text data by trying to identify topics that occur in a collection of documents. In this case, each "document" is a sentence from a philosophical text in our dataset. This notebook generates an LDA (Latent Dirichlet Allocation) model to identify different topics discussed in philosophical texts. LDA assumes that each document is a mixture of different topics, and building a model over text data attempts to learn these topic distributions. From this, we can visualize overall concepts discussed across all schools of philosophy, and analyze what schools write about similar topics. For a more in depth introduction to LDA, you can visit this website.
The first step is to extract features from every document that will then be used to build the topic model. There are various ways to convert text data into numerical features, and in this notebook each sentence is converted into a sparse matrix that represents the word count of the words that appear in each sentence. Then we will use sklearn's implementation of LDA to create a topic model over the philosophy text data.
Before diving into the results of the LDA model, lets take a look at the most common words that appear in the entire philosophical text dataset.
The most frequent words seem to align with what I think philosophers may discuss, from time to nature to the body. It is interesting to note that "things" and "man" are the most frequent words used across all text. Let's take a closer look at word frequencey by topics identified by our model.
Now that we have built the topic model, we can visualize the results using pyLDAvis, a python package that can be used to build an interactive visual of the topics detected by the LDA model. Some things to note about the display:
The graphic below is the result of 8 topics detected by the LDA model:
Note: Click on each bubble to see the most frequent words in the selected topic
Of the 8 topics displayed, there seem to be 4 distinct representations of different topics, and then a clustering of the remaining 4 that seem to be very similar and potentially overlap in ideas or concepts.
Starting at the top right with Topic 5, which seems to be a distinct topic with no overlap with others, and quite distant from the rest of topics. The top words in this topic are "like", "change", "light", and "eyes", with other words like "earth", "sun", and "wind". This topic could potentially represent sentences that discuss the natural world, and what people directly see. Moving on to the bottom right quadrant with Topic 4, this topic's top words are "women", "people", "state", and "animals", and also included frequently are words like "law", "public", "war", "and "society". Perhaps this topic discusses government, laws, and judgement. In a relatively similar topic, topic 1 includes words like "value", "labor", "money", and "price", most likely indicating concepts around economical topics.
Let's move on to the topics that appear to be more similar to eachother than the previous ones discussed. While there is no overlap with the remaining 4 topics, topic 6 seems to be closely related to them. This topics top words include "world", "object", "consciousness", "concept", in addition to words like "existence", "understanding", and "reality". To me, this topic makes sense in terms of philosophical texts. You could imagine philosophers trying to understand reality and the world around them, trying to answer things like what does it mean to exist?
The next four topics seem to be closely related to each other. Topic 8 includes words like "say", "know", "truth", and "question". Another topic with frequent words that seem to make sense in terms of philosophy, searching for the truth and also questioning it. Similarly, Topic 3 includes words like "language", "meaning", "word", and "truth", potentially trying to understand the meaning of the written word. Topic 2 incldues words like "man", "nature", "love", "god", "soul", "desire", and "evil", potentially covering a broad range of topics including religion and the idea of desire and love. Finally, Topic 7 includes words like "body", "motion", "movement", "matter", and "form", potentially covering the discussion of the human body and physical forms.
The above was a brief overview of the topics created from our topic model, let's see if we can get an understanding of the topics discussed in the different schools of philosophy.
Let's take a closer look at the topic distribution assigned to each sentence. The LDA model transforms each sentence into assigned topic proportions based on the words used. From this distribution, we can assign each sentence a dominant topic, or in other words the topic that has the highest contribution proportion based on the LDA transformation. After some cleaning and manipulation, here is what the topic distributions look like for the first five sentences in our dataset:
| Topic1 | Topic2 | Topic3 | Topic4 | Topic5 | Topic6 | Topic7 | Topic8 | dominant_topic | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.01 | 0.01 | 0.01 | 0.49 | 0.27 | 0.01 | 0.01 | 0.19 | 4 |
| 1 | 0.03 | 0.03 | 0.03 | 0.82 | 0.03 | 0.03 | 0.03 | 0.03 | 4 |
| 2 | 0.03 | 0.03 | 0.03 | 0.62 | 0.03 | 0.03 | 0.03 | 0.22 | 4 |
| 3 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.56 | 8 |
| 4 | 0.03 | 0.46 | 0.03 | 0.03 | 0.03 | 0.03 | 0.03 | 0.39 | 2 |
Let's see how many sentences can be attributed to each topic.
The majority of sentences in our philosophy dataset are dominated by topics 8, 6, 2, and 7. These 4 topics are spatially close in our topic visualization, and to me seem to represent concepts common in the discussion of philosophy, so it makes sense a majority of sentences can be attributed to these topics.
An initial finding, while it may be obvious, is that the majority of sentences that correspond to topic 1 (which focuses on words like "money", "labor", and "price") are from sentences from Capitalism and Communism. Both Capitalism and Communism are political ideologies that focus on economics, which does align with the most frequent words seen in topic 1. While they are both very different ideologies, both schools of philosophy seem to discuss similar topics of production, price, and labor representing different thoughts on the matter of the same topic.
Taking a closer at the topic with the most sentences, topic 8 includes sentences from all schools, but is mostly dominated by Analytic, Plato, and Aristotle. This topic has words like "know", "think", "truth", and "question". I am no philosopher and have little knowledge of the intracies of these texts, but we can try to undestand some things at a high level.
Plato's book ranges on topics from Ethics to Forms to Epistemology, or the branch of philosophy that is concerned with knowledge. Aristotle, another Greek philosopher and also a student of Plato, also has a portion of sentences that fall under topic 8. Aristotle's work also discusses epistemology. The majority of sentence that fall into topic 8 are from the Analytic school of philosophy, which covers a broad range of both concepts and ideas, with an emphasis on clarity and the insistence on an explicit argument. A philosopher or an otherwise interested person could take a look at these sentences grouped together in topic 8 from these 3 schools of philosophy and perform a review of what potentially these schools have in common, without having to read enormous volumes of text.
Now let's take a look at each school and the topics that are covered.
Again, no surprise that sentences from Capitalism and Communism both mostly fall into the first topic. Below, you can see the most frequent words in sentences belonging to each school, and it is no surprise there is a lot of similarities and overlap.
As previousy mentioned, I am no philosopher and do not have much/any experience in literature reviews of these philosophical texts. From the topic distributions above, I can make a less obvious (at least to me) connection between German Idealism and Phenomenology. The majority of sentences from both of these schools fall under topic 6. Let's take a look at the most frequently used words in both of these schools of philosophy.
Again, there are a lot of similarities, including consciousness, being/existing. Referring to the first visuals of this notebook, Phenomenology has 5 titles from 3 different authors, and German Idealism has 7 titles from also from 3 different authors. One of the books from German Idealism is called "The Phenomenology Of Spirit". Are the two schools of thought similar in topics discussed, or perhaps is there overlap in one specific author's ideas?
All of the various texts by each author have over 40% of their sentences dominated by Topic 6. The majority of German Idealism titles (all but one) have over 50% of their sentences in topic 6. Interestingly, the titles from the Phenomenology author Husserl have 75% and 69% of the sentences dominated by topic 6. Perhaps concepts discussed in his texts significantly overlap with the ideas presented in German idealism.
While Husserl's texts have a higher proportion of sentences that can be attributed to topic 6, the other two phenomenology authors still have a large proportion of sentences that overlap with the topic, so I still think it can be concluded the there is an overlap in content and concepts discussed between the schools of German Idealism and Phenomenology, with perhaps more similarities between Husserl and the authors that fall into the school of German Idealism.
Through the use of Topic Modeling, we were able to identify different concepts and topics that were presented in various philosophical texts from 13 different schools of philosophy, and a total of 36 different authors. Utilizing an implementation of LDA from sklearn, we were able to identify 8 topics and show the most frequent words from each. Unsuprisingly, common words were things like "consciousness", "existence", "truth", and "self",, all things one can imagine a philosopher conteplates on a day to day basis. We saw four distinct topics in the model visualization, as well as four topics that overlapped in similarity - these topics seem to be general ideas, perhaps making sense that these topics have overlapping content.
We then dove into topic distributions by school, seeing unsurprisingly that Capitalism and Communism significantly overlap in topic 1, which focuses on economic concepts and ideas. Perhaps less obvious, at least to me, was the significant overlap in German Idealism (60%) and Phenomenology (49%), having a significant proportion of sentences contributing to topoic 6. Was this due to overlappoing ideas from one author, or one specific text? While it seems like Husserl from the school of Phenomenology has a higher proportion of his sentences falling under topic 6 compared to his peers in Phenomenology, there does seem to still be evidence of an overlap in concepts discussed across both schools of philosophy rather than overlapping topics from a subset of authors or texts.
There are many more comparisons that can be made using the analysis and visuals in this notebook. This notebook is a good example of how data analysis and machine learning techniques can be used to greatly speed up and assist in the literature review of these philosophical texts.